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DETAILED ACTION 

Remarks 

1 . In response to communications filed on 3 November 2006, claims 1-5 are 
amended, and claims 6-7 are added per applicant's request. Claims 1-7 are pending in 
the application. 

2. The amendments to the specification filed 3 November 2006 have been entered. 



Claim Rejections - 35 USC § 101 

3. 35 U.S.C. 101 reads as follows: 

Whoever invents or discovers any new and useful process, machine, manufacture, or composition of 
matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the 
conditions and requirements of this title. 

4. Claims 1-7 are rejected under 35 U.S.C. 101 because the claimed invention is 
directed to non-statutory subject matter. 

The claims do not recite a practical application by producing a physical transformation 
or producing a useful, concrete, and tangible result. To perform a physical 
transformation, the claimed invention must transform an article or physical object into a 
different state or thing. Transformation of data is not a physical transformation. A 
useful, concrete, and tangible result must be either specifically recited in the claim or 
flow inherently therefrom. To be useful the claimed invention must establish a specific, 
substantial, and credible utility. To be concrete the claimed invention must be able to 
produce the same results given the same initial starting conditions. To be tangible the 
claimed invention must produce a practical application or real world result. In this case 
the claims fail to produce a useful or tangible result. As to usefulness, there is no 
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claimed result of the computations recited in the claims. As to tangibility, there is no 
claimed real world result or output. 

Simply generating an indication is neither useful or tangible, as there is nothing 
being done with that indication. In addition to this, the method can branch, either by 
finding a string that is less than the threshold edit distance value or stopping edit 
distance calculation of a string that is found to be greater than the threshold (see steps 
(e) and (f) of claim 1). As claim (f) may always occur, it is possible that there will never 
be a result. A useful and tangible result must exist in all possible branches the method 
will follow. 



Claim Rejections - 35 USC § 112 

5. The following is a quotation of the second paragraph of 35 U.S.C. 112: 

The specification shall conclude with one or more claims particularly pointing out and distinctly 
claiming the subject matter which the applicant regards as his invention. 

6. Claims 1 and 3 is rejected under 35 U.S.C. 112, second paragraph, as being 
indefinite for failing to particularly point out and distinctly claim the subject matter which 
applicant regards as the invention. 

Claim 1 recites the limitation "forbearing from the indicating" in line 23. Claim 1 
also recites the phrases "et seq." in lines 35-36 and 41. These limitations are unclear. 

Claim 3 recites the limitation "a the lowest cell of the individual column" in line 5. 
This is ambiguous, as 'a' is indefinite and signifies the introduction of a new element, 
while 'the' is definite and implies antecedent basis. 
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Claim Rejections - 35 USC § 103 

7. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

8. Claims 1-2 and 6-7 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over Zien et al . (US Patent 6,556,984) in view of Tomikawa et al . (US Pre-Grant 
Publication 2002/0072863). 

As to claim 1 , Zien et al . teaches: 
(a) obtaining the search string (see 2:39-48) 
Zien et ai . does not teach obtaining a threshold vaiue. 
Tomikawa et al . teaches and the threshold value (see paragraph [0198]); 
Zien et al . as modified teaches (b) selecting a first text from the list of texts as a 
present computation text (see Zien et al . 2:39-48); 

(c) computing, column-by-column, a grid of edit distance values between the 
search string and the present computation text (see Zien et al . 4:48-5:1 1); 

(d) stopping the computing in response to computing a column whose minimum 
value of edit distance is at least the threshold value (see Zien et al . 4:48-5:1 1 . A table of 
edit distances is created; 6:42-50, A tree with depth first search can be used. Also see; 
Tomikawa et al . paragraph [0198] which teaches pruning a branch of a search tree if a 
threshold is passed); 
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(e) in response to completing the computer and the computer edit distance from 
the present computation text to the search string being below the threshold value, 
generating an indication that the edit distance of the present computation text from the 
search string is less than the threshold value (see Zien et al . 9:38-39. A cost is returned. 
In addition to this, Tomikawa et al . prunes search paths that are greater than the 
threshold value); 

(f) in response to either stopping the computing or completing the computing and 
the edit distance from the present computation text to the search string not being below 
the threshold value, forbearing from the indicating (see Tomikawa et al . paragraph 
[0198]. Search paths that are greater than the threshold value are pruned and not 
completed); 

(g) in response to completing the computing, selecting a next text in the list after 
the present computation text, as the present computation text (see Zien et al . 4:18-30); 

(h) in response to stopping the computing, selecting a next text, in the list after 
the present computation text, that does not share with the present computation text a 
prefix corresponding to columns of the grid up to and including the column whose 
minimum value of edit distance is at least the threshold value, as the present 
computation text (see Zien et al . 7:40-65. A tree is created that corresponds to the grid, 
and depth first search is used on the tree. Also see; Tomikawa et al . paragraph [0198] 
which teaches pruning a branch of a search tree if a threshold is passed. Pruning will 
result in precluding as a result any nodes beneath the node that has passed the 
threshold); 
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i) in response to step (h) returning to steps (c) et seq. (see Zien et al . 4:18-30 
and 6:55-7:19); 

0) in response to step (g), returning to steps (c) et seq., but re using in step (c) 
columns of the grid computed for previous said computation text that correspond to any 
prefix shared by the previous computation text and the present computation text (see 
Zien et al . 5:21-49); and 

(k) continuing to perform steps (c) et seq. until selecting reaches an end of the 
text list (see Zien et al . 5:12-20 and 6:65-7:19) 

Therefore, it would have been obvious to one of ordinary skill in the art at the 
time the invention was made to have modified Zien et al . by the teaching of Tomikawa 
et al .. since Tomikawa et al . teaches that "an object of the invention is to provide 
method and apparatus capable of automatically extracting and evaluating mutually 
coinciding or similar portions between sequences of atoms or atomic groups in 
molecules such as protein molecules in accordance with a simple processing 
mechanism" (see paragraph [0021]). 

As to claim 2, Zien et al . as modified teaches further comprising: 
Ordering the test list in a sequence to place texts with shared prefixes adjacent 
one to another in the sequence (see Figures 3A-3B and 6:65-7:19). 

As to claim 6, Zien et al . as modified teaches further comprising: 
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Prior to step (b), sorting the texts in the list in lexicographical order (see Figures 
3A and 3B, and 6:65-7:19). 

As to claim 7, Zien et al . as modified teaches wherein: 
Computing comprises 

using dynamic programming to perform the computing (see Zien et al . 1:62-67). 

Response to Arguments 

9. Applicant's arguments filed 3 November 2006 have been fully considered but 
they are not persuasive. 

In response to applicant's argument that the claimed invention is directed to non- 
statutory subject matter, Examiner notes that "generating an indication" is neither 
tangible or useful, as there is nothing being done with the indication. Examiner also 
notes that Applicant cites element (e) of Independent claim 1 as a concrete, useful, and 
practical result. However, element (e) only occurs when computing is completed. 
Therefore the result of "generating an indication" will not always occur. In element (f), 
"forbearing from the indicating" is not a useful or tangible result. Therefore, a useful and 
tangible result does not exist in every branch the program could take. 

Applicant's arguments with respect to claims 1-2 have been considered but are 
moot in view of the new ground(s) of rejection. 
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Conclusion 

10. Applicant's amendment necessitated the new ground(s) of rejection presented in 
this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP 
§ 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 
CFR 1.136(a). 

A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within 
TWO MONTHS of the mailing date of this final action and the advisory action is not 
mailed until after the end of the THREE-MONTH shortened statutory period, then the 
shortened statutory period will expire on the date the advisory action is mailed, and any 
extension fee pursuant to 37 CFR 1 .136(a) will be calculated from the mailing date of 
the advisory action. In no event, however, will the statutory period for reply expire later 
than SIX MONTHS from the date of this final action. 
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Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Charles D. Adams whose telephone number is (571) 
272-3938. The examiner can normally be reached on 8:30 AM - 5:00 PM, M - F. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Charles Rones can be reached on (571 ) 272-4085. The fax phone number 
for the organization where this application or proceeding is assigned is 571-273-8300. 

Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 
USPTO Customer Service Representative or access to the automated information 
system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 
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ABSTRACT 

This paper argues that for some applications direct search 
for association rules can be more efficient than the tw o stage 
process of the Apriori algorithm which first finds large item- 
sets whic hare then used to iden tify associations. In par- 
ticular, it is argued, Apriori can impose large computa- 
tional overheads when the number of frequen titemsets is 
very large. This will often be the case when association rule 
analysis is performed on domains other than basket analy- 
sis or when it is performed for basket analysis with basket 
information augmented b y other customer information. An 
algorithm is presented that is computationally efficient for 
association rule analyses during which the n unber of rules to 
be found can be constrained and all data can be maintained 
in memory. 

Categories and Subject Descriptors 

H. 2.8 [Database Management]: Database Applications — 
data mining] 1.2.6 [Artificial Intelligence]: Learning; H.3.3 
[Information Storage and Retrieval]: Information Search 

and Retrieval 

General Terms 

Association Rule, Search 

I. INTRODUCTION 

The Apriori algorithm [2] and its deriv atives[15, 11, 17] 
have become the de facto standard for disco/ ering associa- 
tion rules. This paper presents an alternative approach to 
association rule discovery that may be more efficient when 
all data can be retained in memory and the number of can- 
didate itemsets cannot be adequately constrained by con- 
sidering individual itemsets in isolation. Given the current 
availability of very large memory machines, many potential 
applications of the new algorithm may satisfy the first con- 
strain t. Many data miners will consider their time more 
valuable than the cost of a few extra gigabytes of memory. 
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The Apriori algorithm relies on constraining the number 
of itemsets by considering features of itemsets in isolation, 
most commonly, by placing a low er limit on the frequency 
of an itemset, below which itemsets will not be considered. 
This is often feasible for simple basket analysis, as few com- 
binations of products will be bought together in large quan- 
tities. Ho wevereven for bask etanalysis, the numbers of 
frequen t itemsets my rapidly increase if simple basket anal- 
ysis is augmented b y considering socio-economic or other at- 
tributes of the customers. Augmenting simple basket anal- 
ysis in this way can add much to the ric hness of the knowl- 
edge gained. Ho wevertf a customer description attribute 
is common to 50% of the customer base then that attribute 
will occur frequently with a large number of item combina- 
tions. Add a number of sue h attributes to the analysis and 
the n umber of frequent itemsets can rapidly expand to an 
exten t where application of Apriori becomes infeasible. 

The same problem occurs when association rule analysis is 
applied to domains other than basket analysis. Association 
rules can be a very valuable tool for discovering in teresting 
in ter-relationships beiw een v ariables in man different types 
of domain, as they do not filter through a machine learning 
bias the rules that are presented to the user. This enables 
the user to iden tify the in teresting rules rather than rely- 
ing on a machine learning system to determine the rules of 
in terest. 

This paper describes how a search algorithm can take adv an- 
tage of in ter-association-rule constrains to find association 
rules efficiently. 

2. BACKGROUND 

Early approaches to identifying in teresting rules from data 
w ere dominated ly attempts to form small sets of rules for 
accurate classification of further previously unsighted data 
[9, 7, 13]. For the most part, borro wing from an elegant 
characterization of mining optimized rules by Bayardo and 
Agraw al [3], this activiy can be characterized as follows: 

• A training set is a finite set of records where each 
record is an element to which we apply Boolean pred- 
icates called conditions. 

• A rule consists of tw o conditions or combinations of 
conditions (typically conjunctions or, less frequen tly, 
disjunctions) called the ante cedenfand conse quent A 
rule with anteceden tA and consequent C is denoted 
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as A -> C. 

• The search is limited to exploring rules that have as 
consequent the values of a distinguished attribute, called 
the class attribute. 

• The search seeks a set of rules that optimize some 
function of quality. The search is usually incremental, 
adding one rule at a time. The quality function usu- 
ally attempts (often indirectly) to trade-off complexity 
against errors on the training set. 

• In consequence, the rules selected tend to have an- 
tecedents that select subsets of the training set that 
are strongly dominated by a single class variable. 

In the nineties, this research program took two divergent 
branches. On one hand, a number of researchers explored 
techniques for identifying large numbers of classification rules 
[4, 8, 10, 12, 14, 16]. This work was distinguished by the 
removal of the objective of using the rules for classification 
and hence of the requirement that a small number of rules 
be identified. Rather, all rules that satisfied some criterion 
of interestingness were sought. Interestingness was usually 
evaluated by some measure that led to identification of rules 
for which the antecedent identified subsets of the training set 
that were dominated by a single value of the class attribute, 
the intent being to predict the occurrence of that value. 

The other branch was association rule discovery [1]. As- 
sociation rule discovery differs in intent from most other 
rule discovery paradigms. While the other paradigms have 
concentrated on finding rules that are predictive of a sin- 
gle, preselected, class variable, association rule discovery has 
been motivated by finding rules that predict increased fre- 
quency of an attribute value, or collection of attribute val- 
ues, without limitation on the values that may appear in 
the consequent of a rule. Association rule discovery can be 
distinguished by the aims of 

• discovering all rules that satisfy a given set of con- 
straints, 

• an emphasis on processing large training sets, and 

• allowing any available condition to appear as either an 
antecedent or consequent. 

Due to its emphasis on analysis of large datasets, association 
rule discovery has concentrated on algorithms that process 
data via database access whereas the other branches of rule 
discovery have tended to concentrate on algorithms that re- 
tain all data in memory. This has led to the development of 
very different forms of algorithm. Association rule discov- 
ery algorithms have sought to minimize the number of passes 
through the data due to the very high time overheads that 
these imply when accessing a database. This is less of a 
concern when data is retained in memory. 

Recent research has started to bring these two divergent 
branches of rule discovery research back together. Bayardo 
and Agrawal [3] present a variant of the OPUS search al- 
gorithm [18], developed in the context of classification rules 



research, to discover key rules of the type sought by as- 
sociation rule discovery. However, as is typical in classifica- 
tion rules research, their technique considers only the search 
space for a single consequent at a time, limiting its applica- 
bility in the most common association rule activity, market 
basket analysis, where it is often desirable to consider every 
product as a possible candidate rule consequent. 

This paper presents techniques for employing the OPUS 
search algorithm for rule discovery where the search space 
encompasses rules for which the antecedent can contain any 
conjunction of available predicates and the consequent can 
be any single predicate. It is further distinguished by the 
ability to efficiently find a prespecified number of rules that 
maximize an arbitrary function measuring rule quality. This 
distinguishes the approach from typical association rule al- 
gorithms that explore all rules that satisfy prespecified con- 
straints. This distinction is particularly significant. For 
dense search spaces, typical rule constraints may result in 
numbers of itemsets that make the Apriori approach infea- 
sible. The ability to restrict search to a predefined number 
of target rules can allow the new algorithm to efficiently 
process such search spaces. 

A major concern in developing association rule algorithms 
has been minimizing the number of database accesses that 
are required. I contend that the need to do this is reduced 
if the database is retained in main memory. I further con- 
tend that doing so is now feasible for a large range of data 
mining tasks due to the increase in the availability of very 
large memory computers. However, I recognize that there 
will always remain some tasks for which it is not feasible to 
retain a sufficient sample of cases in memory for acceptable 
association rule discovery. The techniques explored in this 
paper do not address that scenario. 

2.1 The Apriori algorithm 

The Apriori algorithm discovers association rules in two 
steps, utilizing the concept of an itemset An itemset is 
a conjunction of conditions 1 . A large itemset is an item- 
set that occurs more frequently than a predefined minimum 
frequency. The Apriori algorithm exploits the observation 
that many common measures of the value of an associa- 
tion rule are functions of the frequency of LHS, RHS, and 
LHS ARHS, where LHS and RHS represent, respectively, 
the itemsets for the antecedent and consequent of the asso- 
ciation rule. 

The two top-level steps of the Apriori algorithm are: 

1. Find all large itemsets. 

2. Generate association rules from the large itemsets. 
The first stage plays two roles, 

1 . limiting the number of rules that need be explored to 

*In basket analysis the relevant conditions are predicates, 
one for each of the available items, each of which is true iff 
the corresponding item was purchased by a customer, hence 
the name itemset. 
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those for which the union of the LHS and RHS occur 
with sufficient frequency, and 

2. caching the relevant information about those itemsets, 
specifically their frequency, so that the search for asso- 
ciation rules need not repeatedly access the database 
to compute them. 

This strategy can be very successful at reducing the num- 
ber of passes through the data base. Indeed, variants of the 
approach can reduce database access to two passes [15, 17]. 
However, where there are numerous large itemsets, the over- 
heads of itemset maintenance and manipulation can severely 
impact upon the computational feasibility of the approach. 
A dramatic illustration of this is provided in Section 3, 
below. In this example, applying Apriori with standard 
settings to the Cover Type data set, with just 120 items, 
but with many items occurring very frequently, results in 
14,567,892 large item sets. With so many large itemsets, 
management and manipulation of those itemsets creates a 
large computational burden. 

2.2 The OPUS search algorithm 

The move towards search for large numbers of classifica- 
tion rules resulted in the development of algorithms for effi- 
cient traversal of the search spaces involved. These initially 
relied upon assigning an arbitrary order to the conditions 
which was then used to structure the search space so that 
each combination of conditions was considered just once. A 
search space with four conditions (a, 6, c, and d) structured 
in this manner is presented in Fig. 1. 

Such a search space is exponential in size. If there are 10,000 
conditions, a figure commonly exceeded in market basket 
analysis, the search space size is 2 0,000 . Clearly it will 
only be possible to explore such a search space if it can be 
pruned. Under fixed structure search, algorithms typically 
seek branches that cannot contain a solution 2 , and prune 
those branches. Fig. 2 demonstrates the effect of pruning 
the branch for condition c from the fixed-structure search 
space illustrated in Fig. 1. As can be seen, this removes 
only one node from the search space. 

The identification of branches to be pruned requires pruning 
rules. These identify regions of the search space that cannot 
contain a solution. In rule discovery search, many pruning 
rules consider for a given node TV whether any search node 
in the space below N that contains a given condition C can 
be a solution. Thus, the pruning illustrated in Fig. 1 may 
have resulted from a pruning rule identifying that no node 
containing c may contain a solution. In this case, the ideal 
outcome would be the removal from the search space of all 
nodes containing c, as illustrated in Fig 3. As can be seen, 
this approximately halves the remaining search space 3 

2 What constitutes a solution will depend upon the search 
objective. For example, in association rule discovery, a solu- 
tion might any set of conditions that is a frequent itemset. 
3 It does not exactly halve the remaining search space as the 
root node has already been visited, as, depending upon the 
search technique, may have the node containing c, and hence 
these nodes should not be counted as part of the remaining 
search space. 



An elegant method of achieving this outcome is to reorder 
the search space so that any condition to be pruned at a 
node precedes all conditions not to be pruned. This is the 
OPUS s strategy [18]. This algorithm guarantees that every 
pruning action approximately halves the remaining search 
space. The OPUS 0 algorithm [18] extends OPUS s for opti- 
mization search, using a heuristic that reorders the search 
space to maximize the amount of the space associated with 
the least promising search operator 4 . This is illustrated in 
Fig. 4. The OPUS algorithms have been demonstrated to 
support efficient complete search of a number of standard 
rule discovery search tasks[18]. 

A further approach to pruning is provided by inclusive prun- 
ing [19]. Whereas the (exclusive) pruning actions illustrated 
above involve excluding from the search space those nodes 
containing a particular condition, inclusive pruning results 
in the exclusion of all nodes that do not contain a given con- 
dition. Like exclusive pruning, each inclusive pruning action 
approximately halves the remaining search space. 

2.3 Efficient search for association rules 

Many frequent itemsets will relate to association rules that 
are not of interest. This might be addressed by placing ad- 
ditional constraints upon the itemsets that are considered. 
It is possible, although computationally expensive, to take 
account of the relationship between the antecedent and con- 
sequent of association rules that might be derived from an 
itemset, such as the potential lift 5 . However, this would 
require duplicating during the first stage much of the work 
of the second stage of the Apriori algorithm. More impor- 
tantly, it is not possible to impose constraints that rely on 
relationships between association rules, such as only find- 
ing itemsets that could participate in the 1000 association 
rules with the highest lift. It will often be the case that the 
end users to receive the association rule reports will only 
be interested in considering a limited number of association 
rules. Selecting a prespecified number of those that maxi- 
mize a particular measure will be desirable from the user's 
perspective and can be used to constrain a directed search 
for association rules. 

Search for association rules can be tackled as a search pro- 
cess that starts with general rules (rules with one condition 
on the LHS) and searches through successive specializations 
(rules formed by adding additional conditions to the LHS). 
Such search is unordered. That is, the order in which suc- 
cessive specializations are added to a LHS is not significant. 
AABAC -4 X is the same is CaBAA X. An important 
component of efficient search in this context is minimizing 
the number of association rules that need be considered. A 
key technique used to eliminate potential association rules 
from consideration is optimistic pruning. Optimistic prun- 
ing operates by forming an optimistic evaluation of the high- 
est rule value that may occur in a region of the search space. 

4 In rule search each condition can be considered a search 
operator. Formally, the search operator is the inclusion of 
the condition in the set of conditions associated with a node. 
5 Lift is a frequently utilized measure of association rule util- 
ity. The lift of an association rule = ^f**"^ 1 ^ 1 
where |X| is the number of cases with conditions X and 
n is the total number of cases in the data set. 
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Figure 1: A fixed-structure search space 
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Figure 2: Pruning a branch from a fixed-structure search space 
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Figure 3: Pruning all nodes containing a single operator from a fixed-structure search 
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Figure 4: Pruning with a restructured search space 



An optimistic evaluation is one that cannot be lower than 
the actual maximum value. If the optimistic value for a re- 
gion is lower than the lowest value that can be of interest, 
then that region can be pruned. If search seeks the top m 
association rules, then it can maintain a list of the top m 
rules encountered so far during the search. If an optimistic 
evaluation is lower than the lowest value of a rule in the top 
m, then the corresponding region of the search space may 
be pruned. Other pruning rules may identify regions that 
can be pruned because they can contain only rules that fail 
to meet prespecified constraints such as: 

• minimum support (the frequency in the data of the 
RHS or of the RHS and LHS in combination); 

' • minimum lift (as defined in footnote 5); or 

• being one of the top m association rules on some spec- 
ified criteria. 

I use the term credible rule to denote association rules for 
which, at some given point in a search, it is possible that 
the rule will be of interest, using whatever criteria of interest 
apply for the given search. 

If we restrict association rules to have a single condition on 
the RHS, two search strategies are plausible, 

1. for each potential RHS condition explore the space of 
possible LHS conditions; or 

2. for each potential LHS combination of conditions ex- 
plore the space of possible RHS conditions. 



These considerations mitigate in favor of the second strat- 
egy. We systematically explore the space of possible LHS 
condition combinations, searching from the general to the 
specific. During this process we track the set of condi- 
tions that can appear on the RHS of a credible rule in 
the search beyond the current point. We then organize the 
search to attempt to minimize the number of LHS condi- 
tion combinations that are explored. A single pass through 
the data can be performed for every LHS combination dur- 
ing which all statistics are collected for both the LHS and 
each of the RHS conditions currently under consideration. 
We prune from the search space any regions of potential 
LHSs for which optimistic evaluation can ascertain no RHS 
can result in a credible rule. The relative efficiency of this 
approach against the Apriori approach will depend on the 
cost of a pass through the data (lower favoring the new di- 
rect search), the number of frequent itemsets (lower favoring 
Apriori), and the number of LHS combinations that must 
be explored (lower favoring direct search). 

Table 1 displays the algorithm that results from applying 
the OPUS search algorithm [18] to obtain efficient search for 
this search task. The algorithm is presented as a recursive 
procedure with three arguments, 

CurrentLHS: the set of conditions in the LHS of the rule 
currently being considered. 

AvailableLHS: the set of conditions that may be added to 
the LHS of rules to be explored below this point 

AvailableRHS: the set of conditions that may appear on 
the RHS of a rule in the search space at this point and 
below 



The former strategy leads to the most straight-forward im- 
plementation as it involves a simple iteration through a 
straight-forward search for each potential RHS condition. 
However, this implies accessing the count of the number of 
cases covered by the LHS many times, once for each RHS 
condition for which an LHS is considered. At the very least 
this entails the computational overheads of caching informa- 
tion. At the worst it requires a pass through the data each 
time the value is to be utilized. While a pass through the 
data has lower overheads when the data is stored in memory 
rather than on disk, it is still a time consuming operation 
that must be avoided if computation is to be efficient. 



The initial call to the procedure sets CurrentLHS to {}, and 
AvailableLHS and AvailableRHS to the sets of conditions 
that are to be considered on the LHS and RHS of association 
rules, respectively. 

Step 2(c)iiA records each credible association rule as it is 
evaluated. If the search seeks the m best rules on some 
metric, once m rules have been added at this step, as new 
rules are added, the rule with the lowest value on the metric 
can be removed from the table of best rules. A rule will 
not be credible if it fails other constraints, such as minimal 
strength, or, once the table is full, has lower value on the 
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Tbble -1: The OPUS search algorithm adjusted for 
search for association rules 

Algorithm: OPUS-AR (CurrentLHS , AvailableLHS , 
AvailableRHS) 

com CurrentLHS is the set of conditions in the LHS 
of the rule currently being considered. 

com AvailableLHS is the set of conditions that may 
be added to the LHS of rules to be explored 
below this point 

com AvailableRHS is the set of conditions that 
may appear on the RHS of a rule in the search 
space at this point and below 

1. SoFar := {} 

2. FOR EACH P in AvailableLHS 

• (a) NewLHS := CurrentLHS U {P} 

(b) AvailableLHS := AvailableLHS - P 

(c) IF pruning rules cannot determine that 
Vx C AvailableLHS: Vv 6 AvailableRHS: 
-.credible (x U NewLHS -> y) THEN 

i. NewAvailableRHS = AvailableRHS 
ii. FOR EACH Q in AvailableRHS 

A. IF credible (NewLHS -> Q) THEN 

record NewLHS -4 Q 

B. IF pruning rules determine that 
Vx C AvailableLHS: x = {} V 
-.credible (x U NewLHS -> Q) THEN 

NewAvailableRHS := 
NewAvailableRHS - Q 
iii. IF NewAvailableRHS ^ {} THEN 
OPUS _AR (NewLHS, SoFar, 
NewAvailableRHS) 
iv. SoFar := SoFar U {P} 



evaluation metric than the worst rule in the table of best 
rules. 

Step 2c prunes conditions from the space of those explored 
on the LHS of a rule. Rather than exploring the space of pos- 
sible LHS sets beyond the current one, optimistic techniques 
with low computational overheads should be employed. For 
example, if \CurrentLHSU {P}\ is less than minimum sup- 
port then no rule in the relevant space of possible rules 
can achieve minimum support as all are specializations of 
\CuTTentLH S U {P}\ and hence cannot have higher sup- 
port. 

Step 2(c)iiB prunes conditions from the space of those ex- 
plored on the RHS of a rule. Optimistic rules with low 
computational overheads should be employed here also. For 
example, if | Current LHS U {P}\ = 0 then no credible rule 
will exist in the relevant space of possible rules. 

For both of the pruning steps, the exact pruning rules to be 
employed will depend upon the specific constraints for the 
search. 

Without pruning this algorithm will systematically explore 
the entire search space. The pruning step removes from the 
search space below a node all and only those rules contain- 
ing the identified condition. It follows, therefore, that the 
algorithm is complete, always finding the target association 
rules, so long as the pruning rules employed are correct. 

This algorithm is based on OPUS s rather than OPUS°. This 
is because the more efficient OPUS° requires at least two 
passes through the available LHS conditions at each node of 
the search tree, one to select and sort the LHS conditions 
and the second to make the recursive call for each LHS with 
the appropriate second and third arguments. Tbp overheads 
of doing this are excessive for this search task because an 
evaluation of which RHS conditions should be retained for 
each LHS would need to be performed in both loops. If 
there are a very large number of potential RHS conditions, 
either calculating this each time or caching the information 
between loops, will have very high overheads. For example, 
if there are 1,000 conditions then there might be 1,000 LHSs 
for each of which 1,000 potential RHS values need to be con- 
sidered. Examining each of the resulting 1,000,000 possible 
combinations twice would clearly be undesirable as would 
caching such a large number of values. Thus, a single pass 
approach is employed that sacrifices the efficiencies to be 
gained from dynamic reordering on optimistic value but de- 
livers far greater efficiency in processing a search node than 
would otherwise be possible. 

3. AN EXAMPLE 

The largest dataset in the UCI machine learning repository 
was subjected to association rule analysis using both the 
A priori algorithm and the above OPUS search. The Cover 
Type data set was selected as the largest of the UCI machine 
learning repository datasets. A data set from the machine 
learning repository was used instead of one from the UCI 
KDD repository due to ease of access by the researcher. 
The Cover Type data set was already in a format that could 
be directly employed by both the Apriori and OPUS search 
software without further data manipulation. 
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The Cover Type data set was collected for the purpose of 
predicting forest cover type from cartographic variables only 
[5]. However, it is quite conceivable that association rule 
analysis might also detect interesting inter-relationships be- 
tween those cartographic variables in addition to between 
them and the variable describing the forest cover. 581,012 
cases are described by 55 attributes. The ten continuous val- 
ued attributes were discretized into three sub-ranges with as 
close as possible to equal numbers of cases within each sub- 
range. The remaining 45 attributes were all binary. In con- 
sequence there were 120 at tribute- values, each of which was 
treated as a separate condition for association rule analysis 
purposes. Note that this treatment results in many frequent 
items, as for each binary attribute at least one value must 
occur for > 50% of the cases. 

The publicly available apriori system developed by Borgelt 
[6] was applied to the Cover Type dataset. This implemen- 
tation of Apriori generates rules with a single RHS condi- 
tion and multiple LHS conditions, thus exploring the same 
space of rules as the OPUS based algorithm. It generated 
14,567,892 itemsets when employed with its default settings 
(maximum itemset size of 5; minimum coverage of 10% of 
the data for the LHS of a rule; minimum strength of 80%). 
The coverage of a set of conditions is the proportion of the 
training set for which the conditions are true. The strength 
of an association is the coverage of the union of the LHS and 
RHS divided by the coverage of the LHS. From the mini- 
mum LHS coverage and strength apriori can determine that 
only itemsets with coverage of 8% or higher need be gener- 
ated. This required 96 hours and 44 minutes CPU time on 
a 350MHz PHI linux computer. It was not possible to com- 
plete the generation of all association rules as the file size 
limit was exhausted after 30,677,279 rules were generated. 

• - - ■ - — • " "O - *■ *■ 

the same computer. The same search space was explored to 
find the top 1000 associations on lift. 

Four pruning rules were employed. To describe these we use 
the following abbreviations. 

• cover(s) is the coverage of the set of conditions s, the 
proportion of the training set for which the conditions 
in s are all true. 

• strength(LHS -* RHS) is the strength of associa- 
tion rule LHS -> RHS. strength(LHS RHS) = 
cover{LHS U RHS) /cover (LHS). 

The first pruning rule, used at step 2c, prunes any condi- 
tion P for which cover(NewLHS) < minLH Scover, where 
minLH Scover is the minimum allowed LHS coverage. No 
superset of such a LHS can exceed the minimum LHS cov- 
erage as the coverage of a superset of conditions must be no 
larger than the coverage of the original set of conditions. 

The second pruning rule is used at step 2(c)iiB. It prunes 
any RHS condition Q for which cover(NewLHS U {(?}) < 
minRHScover, where minRHScover = minLH Scover x 
minstrength and minstrength is the minimum allowed 
value for association strength. This is the minimum allowed 



coverage for LHSURHS for any association. The justifica- 
tion for this rule mirrors that for the previous. 

The next pruning rule is also used at step 2(c)iiB. This rules 
utilizes an optimistic assessment of the maximum value of 
association strength for a rule with Q as the consequent 
in the search space below the current node. First we de- 
termine the maximum number of specialization operations 
that may be applied to the current node to reach a node 
in the search space below the current node, max spec = 
min(maxXHSsize - \NEWLHS\, \SOFAR\), where 
maxJLHSsize is the maximum number of conditions al- 
lowed in a LHS. There may be no more specializations than 
there are conditions available to specialize by (\SOFAR\). 
Nor may there be more specializations than allowed by the 
constraint on the number of conditions permitted in a LHS. 

Next we determine an upper limit on the maximum reduc- 
tion in coverage that may result from the addition of any 
one condition to the LHS of an association in the search 
space below the current node. All associations in this search 
space cover subsets of the items covered by the associa- 
tion for the current node. Hence, no condition may re- 
move more items from the cover of an association in that 
search space than it removes from the cover of the associ- 
ation for the current node. Hence max covers eduction = 
max(cover(LHS) - cover(LHS U {c}) : c <= SOFAR). 

The next step is to determine the minimum coverage for the 
LHS of a rule in the search space below the current node. It 
is not possible for the coverage to be reduced by more than 
maxspec * max jcover jreduction. Nor is it possible for it 
to be reduced below the minimum allowed LHS coverage. 
Hence, minjcover = max(minLH Scover, cover (LHS) — 
max spec x max .cover _r eduction). 

If minjcover < cover(LHS U {Q}) then the optimistic as- 
sessment of the maximum strength (opt strength) for an 
association with Q as consequent that may lie below the 
current node is 1.0 on the basis that the specializations may 
remove from the cover of LHS all cases that are not covered 

byQ. 

Otherwise, optstrength — cover(LHS U {Q}) /minjcover > 
the result that would be obtained if all reduction in coverage 
removed cases covered by the LHS but not the RHS of the 
associations. 

If optstrength < minstrength, where minstrength is a 
constraint on the minimum allowed value for strength, then 
the RHS condition Q can be pruned. 

The final pruning rule also applies at step 2(c)iiB. This rule 
determines an optimistic value for lift for associations in the 
search space below the current node that have Q as a conse- 
quent. Lift is maximized when strength is maximized. Thus, 
optJift — opt strength/ cover ({Q}). HoptJift < minJift, 
where minJift is the minimum allowed lift, then the RHS 
condition Q can be pruned. Note that minJift could be 
a global constraint on associations, but may also be deter- 
mined dynamically. In the current application, minJift was 
initialized to zero. However, once the target number of as- 
sociations had been added to the table of best association 
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rules, at step 2(c)iiA, minJift was progressively updated 
to equal the minimum value of lift for a rule in the table. 
Hence, as the search progressed and the overall quality of the 
associations in the table improved, more stringent pruning 
could occur. 

Using these pruning rules, a total of 384,312 association 
rules were evaluated and only 84,639 distinct antecedents 
considered. This took 48 minutes and 9 seconds CPU time. 
To find the top 100 associations on lift required the explo- 
ration of 204,264 association rules involving 51,678 distinct 
antecedents and took 26 minutes and 49 seconds CPU time. 

4. DISCUSSION 

The above example demonstrates that using OPUS search 
and pruning the search space on the basis of inter-relationships 
between itemsets, it can be feasible to perform efficient as- 
sociation rule analysis on data sets for which the Apriori 
approach is infeasible. Whether or not this is useful de- 
pends, of course, upon whether there are inter-itemset con- 
straints that should be applied for the given association rule 
application. It seems plausible, however, that for many ap- 
plications an upper-limit on the number of association rules 
to be generated will be appropriate, and this can be all that 
is required to enable efficient search. 

Further search constraints, such as SC-Optiraality [3], might 
usefully be employed to deliver even greater computational 
efficiency within the OPUS_AR framework. 

That OPUS.AR has wider application than the single dataset 
examined herein is demonstrated by the commercial associ- 
ation rule discovery system Magnum Opus 6 . This system, 
that utilizes the OPUS^AR algorithm, is routinely employed 
for commercial association rule discovery from datasets con- 
taining millions ot cases each described by tens ot thousands 
of variables. 

Association rule discovery has been firmly rooted in the do- 
main of market basket analysis. However, prior to the pop- 
ularization of market basket analysis, a number of machine 
learning researchers were exploring techniques with many 
similarities to association rule discovery. These researchers 
were exploring the use of complete or extensive search to 
form large rulesets in the belief that such rulesets could pro- 
vide insight or other utility beyond that obtained from the 
small rulesets normally generated by machine learning sys- 
tems [8, 12, 14, 16]. The current work can be viewed as 
a direct descendent of this research effort, extending it by 
utilizing the efficient OPUS search algorithm and by utiliz- 
ing metrics of rule value developed within the field of basket 
analysis. 

5. CONCLUSIONS 

I have presented an algorithm for association rule analy- 
sis based on the efficient OPUS search algorithm. This ap- 
proach is distinguished from the widely utilized Apriori algo- 
rithm by its ability to use inter-relationships between item- 
sets to constrain the number of itemsets that are considered. 
It is distinguished from a number of recent rule mining algo- 

6 Magnum Opus is distributed by Rulequest Pty Ltd, 
http:/ /www.rulequest.com . 



rithms, that have been presented as alternatives to Apriori 
[4, 3, 10], by exploring associations containing all available 
conditions as consequents. However, the approach has the 
potential disadvantage, compared with Apriori, that it re- 
quires many more passes through the data. Where the data 
can be maintained in main memory this need not be a se- 
rious handicap. The availability of very large memory com- 
puters means that quite sizeable data sets can be retained 
in main memory. Where the data cannot be maintained in 
main memory, however, this approach to association rule 
discovery is unlikely to be feasible. 

A simple example has been used to demonstrate the poten- 
tial advantage of the new approach in some applications. 
Analysis of the Cover Type data set requires generation and 
analysis of 14,567,892 itemsets when the Apriori algorithm 
is utilized, even when the itemset size is restricted to five. In 
contrast, finding the 1000 association rules with the highest 
values of lift within the same constraints required evalua- 
tion of only 677,129 rules and 33,613 distinct antecedents. 
With the implementations employed, the OPUS search was 
completed with all 1000 rules identified in less than 15 CPU 
minutes while it took apriori more than 96 CPU hours just 
to generate the itemsets. This starkly illustrates the poten- 
tial advantages of the new approach. 
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